Method and apparatus for scanning a web site in a distributed data processing system for problem determination

ABSTRACT

A method and apparatus for identifying problems associated with a web site. A scan of a web site is initiated by a plurality of agents, wherein each of the plurality of agents are at a different location in the distributed data processing system. Results of the scan are obtained from the plurality of agents. The results of the scan are analyzed to determine if a problem is associated with the web site.

BACKGROUND OF THE INVENTION

[0001] 1. Technical Field

[0002] The present invention relates generally to an improveddistributed data processing system and in particular to a method andapparatus for identifying problems within a distributed data processingsystem. Still more particularly, the present invention provides a methodand apparatus for scanning computers within a distributed dataprocessing system for problem determination.

[0003] 2. Description of Related Art

[0004] The Internet, also referred to as an “internetwork”, is a set ofcomputer networks, possibly dissimilar, joined together by means ofgateways that handle data transfer and the conversion of messages fromthe sending network to the protocols used by the receiving network (withpackets if necessary). When capitalized, the term “Internet” refers tothe collection of networks and gateways that use the TCP/IP suite ofprotocols.

[0005] The Internet has become a cultural fixture as a source of bothinformation and entertainment. Many businesses are creating Internetsites as an integral part of their marketing efforts, informingconsumers of the products or services offered by the business orproviding other information seeking to engender brand loyalty. Manyfederal, state, and local government agencies are also employingInternet sites for informational purposes, particularly agencies whichmust interact with virtually all segments of society such as theInternal Revenue Service and secretaries of state. Providinginformational guides and/or searchable databases of online publicrecords may reduce operating costs. Further, the Internet is becomingincreasingly popular as a medium for commercial transactions.

[0006] Currently, the most commonly employed method of transferring dataover the Internet is to employ the World Wide Web environment, alsocalled simply “the Web”. Other Internet resources exist for transferringinformation, such as File Transfer Protocol (FTP) and Gopher, but havenot achieved the popularity of the Web. In the Web environment, serversand clients effect data transaction using the Hypertext TransferProtocol (HTTP), a known protocol for handling the transfer of variousdata files (e.g., text, still graphic images, audio, motion video,etc.). Information is formatted for presentation to a user by a standardpage description language, the Hypertext Markup Language (HTML). Inaddition to basic presentation formatting, HTML allows developers tospecify “links” to other Web resources identified by a Uniform ResourceLocator (URL). A URL is a special syntax identifier defining acommunications path to specific information. Each logical block ofinformation accessible to a client, called a “page” or a “Web page”, isidentified by a URL. The URL provides a universal, consistent method forfinding and accessing this information, not necessarily for the user,but mostly for the user's Web “browser”. A browser is a program capableof submitting a request for information identified by a URL at theclient machine. Retrieval of information on the Web is generallyaccomplished with an HTML-compatible browser that browses web sites. Aweb site is a group of related HTML documents and associated files,scripts, and databases that is served up by an HTTP server on the WorldWide Web. The HTML documents in a web site generally cover one or morerelated topics and are interconnected through hyperlinks. Most web siteshave a home page as their starting point, which frequently functions asa table of contents for the site. Many large organizations, such ascorporations, will have one or more HTTP servers dedicated to a singleweb site. However, an HTTP server can also serve several small websites, such as those owned by individuals.

[0007] The Internet also is widely used to transfer applications tousers using browsers. With respect to commerce on the Web, individualconsumers and business use the Web to purchase various goods andservices. In offering goods and services, some companies offer goods andservices solely on the Web while others use the Web to extend theirreach.

[0008] Users exploring the Web have discovered that the contentsupported by HTML document format on the Web was too limited. Usersdesire an ability to access applications and programs, but applicationswere targeted towards specific types of platforms. As a result, noteveryone could access applications or programs. This deficiency has beenminimized though the introduction and use of programs known as“applets”, which may be embedded as objects in HTML documents on theWeb. Applets are Java programs that may be transparently downloaded intoa browser supporting Java along with HTML pages in which they appear.These Java programs are network and platform independent. Applets runthe same way regardless of where they originate or what data processingsystem onto which they are loaded.

[0009] Java™ is an object oriented programming language and environmentfocusing on defining data as objects and the methods that may be appliedto those objects. Java provides a mechanism to distribute software andextends the capabilities of a web browser because programmers can writean applet once and the applet can be run on any Java enabled machine onthe Web.

[0010] With these features on the Internet and especially on the Web,E-commerce activities are becoming more and more important to variouscompanies. Extended enterprises are becoming more common in which anextended enterprise is made up of customers, suppliers, distributors,and other business partners with whom a company conducts onlinebusiness. With this increased e-commerce activity on distributed dataprocessing systems, such as the Internet, it is important to ensure thatweb resources are available and to enable applications and informationbe distributed and maintained across the extended enterprise.

[0011] Companies control the resources, systems, networks, andapplications within their own enterprises. Business practices, however,have changed. Enterprises are increasing becoming extended enterprises.The advent of the Internet has enabled companies to open theire-commerce “doors” to allow customer, suppliers, and distributors toshare critical information online, in order to more efficiently conductbusiness with a wider range of partners. As a consequence, conductingbusiness on the Internet means that companies must rely on a myriad ofrelationships with not only their trading partners, but also uponmultiple Internet service providers (ISP) to conduct businesstransactions. It is important to identify problems with particularservers or web sites located on servers to ensure that e-commerceactivities can be conducted without failure or without delays.

[0012] Therefore, it would advantageous to have an improved method andapparatus for identifying problems on servers and web sites in adistributed data processing system.

SUMMARY OF THE INVENTION

[0013] A method and apparatus for identifying problems associated with aweb site. A scan of a web site is initiated by a plurality of agents,wherein each of the plurality of agents are at a different location inthe distributed data processing system. Results of the scan are obtainedfrom the plurality of agents. The results of the scan are analyzed todetermine if a problem is associated with the web site.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The novel features believed characteristic of the invention areset forth in the appended claims. The invention itself, however, as wellas a preferred mode of use, further objectives and advantages thereof,will best be understood by reference to the following detaileddescription of an illustrative embodiment when read in conjunction withthe accompanying drawings, wherein:

[0015]FIG. 1 depicts a pictorial representation of a distributed dataprocessing system in which the present invention may be implemented;

[0016]FIG. 2 is a block diagram depicting a data processing system,which may be implemented as a server, in accordance with a preferredembodiment of the present invention;

[0017]FIG. 3 is a block diagram illustrating a data processing system inwhich the present invention may be implemented;

[0018]FIG. 4 is an illustration of a scan policy in accordance with apreferred embodiment of the present invention;

[0019]FIG. 5 is a diagram illustrating a data structure returningresults from a scan, one per URL, in accordance with a preferredembodiment of the present invention;

[0020]FIG. 6 is a flowchart of a process used by a server to initiate ascan of a web site or other target server in accordance with a preferredembodiment of the present invention;

[0021]FIG. 7 is a high-level flowchart of a process employed by an agentto scan a target web site for a server in accordance with a preferredembodiment of the present invention;

[0022]FIG. 8 is a flowchart of a process for analyzing problems on websites in accordance with a preferred embodiment of the presentinvention;

[0023]FIG. 9 is a flowchart of a process for statistical analysis of adata stream in accordance with a preferred embodiment of the presentinvention;

[0024]FIG. 10 is a flowchart of a process for analyzing DNS performancein accordance with a preferred embodiment of the present invention; and

[0025]FIG. 11 is a flowchart of a process employed by an agent to scan aweb site in accordance with a preferred embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0026] The present invention provides a method, apparatus, andinstructions for conducting a scan of a server or web site from at leasttwo different locations so that the results can be correlated and thenetwork impacts on the scan can be taken into account. When a problem isdetected on one scan but not on the other scan, heuristics may beemployed to determine whether the problem is a network effect or a realproblem. When scans are performed from a fair number of locations, amajority rules policy may be implemented in making determinations andcoming to conclusions. Alternatively, a discrepancy in scan results mayindicate that a second more thorough scan should be initiated or thatscans from other specific network locations would be helpful in problemdetermination. Additionally, multiple scans over time may be used tocreate a historical database to identify intermittent problems,originating from one or more network locations.

[0027] With reference now to the figures, FIG. 1 depicts a pictorialrepresentation of a distributed data processing system in which thepresent invention may be implemented. Distributed data processing system100 is a network of computers in which the present invention may beimplemented. Distributed data processing system 100 contains a network102, which is the medium used to provide communications links betweenvarious devices and computers connected together within distributed dataprocessing system 100. Network 102 may include permanent connections,such as wire or fiber optic cables, or temporary connections madethrough telephone connections.

[0028] In the depicted example, a server 104 is connected to network 102along with storage unit 106. In addition, clients 108, 110, and 112 alsoare connected to a network 102. These clients 108, 110, and 112 may be,for example, personal computers or network computers. For purposes ofthis application, a network computer is any computer, coupled to anetwork, which receives a program or other application from anothercomputer coupled to the network. In the depicted example, server 104provides data, such as boot files, operating system images, andapplications to clients 108-112. Clients 108, 110, and 112 are clientsto server 104. Server 104 may be a server or computer that is used toinitiate scans of a target resource using clients 108, 110, and 112.These clients are also referred to as “agents” when used to initiatescans of a target resource. In this example, server 114 also isconnected to network 102 and may be a target resource that is scanned byclients 108, 110, and 112 acting as agents. Server 114 as illustrated isa web server, also referred to as a HTTP server, is a server softwarethat uses HTTP to serve up HTML documents and any associated files andscripts when requested by a client, such as a web browser. Theconnection between client and server is usually broken after therequested document or file has been served. HTTP servers are used on Weband Intranet sites. A target resource that is scanned by agents mayinclude other resources in addition to a web server. For example,without limitation, a target resource may be a web site, a file transferprotocol (FTP) site, or a domain name system (DNS) server. The resultsfrom these scans may be returned to server 104 for analysis oralternatively to another server or other computer. In other words, thecomputer initiating the scan is not necessarily the computer that willperform the analysis of the results returned by the agents.

[0029] Distributed data processing system 100 may include additionalservers, clients, and other devices not shown. In the depicted example,distributed data processing system 100 is the Internet with network 102representing a worldwide collection of networks and gateways that usethe TCP/IP suite of protocols to communicate with one another. At theheart of the Internet is a backbone of high-speed data communicationlines between major nodes or host computers, consisting of thousands ofcommercial, government, educational and other computer systems thatroute data and messages. Of course, distributed data processing system100 also may be implemented as a number of different types of networks,such as for example, an intranet, a local area network (LAN), or a widearea network (WAN). FIG. 1 is intended as an example, and not as anarchitectural limitation for the present invention.

[0030] Referring to FIG. 2, a block diagram depicts a data processingsystem, which may be implemented as a server, such as server 104 in FIG.1, in accordance with a preferred embodiment of the present invention.This data processing system is an example of a computer that may be usedto initiate scans of a resource by two or more agents and to analyzeresults returned by the agents scanning the resource. More specifically,data processing system 200 is used to collect and analyze data collectedfrom scans of selected resources, such as web sites, by two or moreagents at different locations in the network. Server 104 is able toidentify problems associated with resources by analyzing the data fromthe scans. For example, broken links, slow links, slow response times,and authorization failures are examples of problems that may beidentified through scans. Although in this example, the computer used toinitiate the scan and analyze the results is a server, other types ofdata processing systems also may be used to initiate and/or analyzeresults.

[0031] Data processing system 200 may be a symmetric multiprocessor(SMP) system including a plurality of processors 202 and 204 connectedto system bus 206. Alternatively, a single processor system may beemployed. Also connected to system bus 206 is memory controller/cache208, which provides an interface to local memory 209. I/O bus bridge 210is connected to system bus 206 and provides an interface to I/O bus 212.Memory controller/cache 208 and I/O bus bridge 210 may be integrated asdepicted.

[0032] Peripheral component interconnect (PCI) bus bridge 214 connectedto I/O bus 212 provides an interface to PCI local bus 216. Typical PCIbus implementations will support four PCI expansion slots or add-inconnectors. Communications links to network computers 108-112 in FIG. 1may be provided through modem 218 and network adapter 220 connected toPCI local bus 216 through add-in boards.

[0033] Additional PCI bus bridges 222 and 224 provide interfaces foradditional PCI buses 226 and 228, from which additional modems ornetwork adapters may be supported. In this manner, server 200 allowsconnections to multiple network computers. A memory-mapped graphicsadapter 230 and hard disk 232 may also be connected to I/O bus 212 asdepicted either directly or indirectly.

[0034] Those of ordinary skill in the art will appreciate that thehardware depicted in FIG. 2 may vary. For example, other peripheraldevices, such as optical disk drives and the like, also may be used inaddition to or in place of the hardware depicted. The depicted exampleis not meant to imply architectural limitations with respect to thepresent invention.

[0035] The data processing system depicted in FIG. 2 may be, forexample, an IBM RISC/System 6000 system, a product of InternationalBusiness Machines Corporation in Armonk, New York, running the AdvancedInteractive Executive (AIX) operating system.

[0036] With reference now to FIG. 3, a block diagram illustrates a dataprocessing system in which the present invention may be implemented.Data processing system 300 is an example of a client computer, which maybe used as an agent to perform scans on a target resource. In thisexample, data processing system 300 may be used to scan a targetresource by performing various tests or sending various requests to thetarget resource. For example, data processing system 300 may be used toaccess a web site, traverse various links located within the web site,and retrieve documents or other resources from the web site. The dataand statistics gathered from a scan are returned to a server or othercomputer for analysis.

[0037] Data processing system 300 employs a peripheral componentinterconnect (PCI) local bus architecture. Although the depicted exampleemploys a PCI bus, other bus architectures such as Micro Channel and ISAmay be used. Processor 302 and main memory 304 are connected to PCIlocal bus 306 through PCI bridge 308. PCI bridge 308 also may include anintegrated memory controller and cache memory for processor 302.Additional connections to PCI local bus 306 may be made through directcomponent interconnection or through add-in boards. In the depictedexample, local area network (LAN) adapter 310, SCSI host bus adapter312, and expansion bus interface 314 are connected to PCI local bus 306by direct component connection. In contrast, audio adapter 316, graphicsadapter 318, and audio/video adapter 319 are connected to PCI local bus306 by add-in boards inserted into expansion slots. Expansion businterface 314 provides a connection for a keyboard and mouse adapter320, modem 322, and additional memory 324. SCSI host bus adapter 312provides a connection for hard disk drive 326, tape drive 328, andCD-ROM drive 330. Typical PCI local bus implementations will supportthree or four PCI expansion slots or add-in connectors.

[0038] An operating system runs on processor 302 and is used tocoordinate and provide control of various components within dataprocessing system 300 in FIG. 3. The operating system may be acommercially available operating system such as OS/2, which is availablefrom International Business Machines Corporation. “OS/2” is a trademarkof International Business Machines Corporation. An object orientedprogramming system such as Java may run in conjunction with theoperating system and provides calls to the operating system from Javaprograms or applications executing on data processing system 300. “Java”is a trademark of Sun Microsystems, Inc. Instructions for the operatingsystem, the object-oriented operating system, and applications orprograms are located on storage devices, such as hard disk drive 326,and may be loaded into main memory 304 for execution by processor 302.

[0039] Those of ordinary skill in the art will appreciate that thehardware in FIG. 3 may vary depending on the implementation. Otherinternal hardware or peripheral devices, such as flash ROM (orequivalent nonvolatile memory) or optical disk drives and the like, maybe used in addition to or in place of the hardware depicted in FIG. 3.Also, the processes of the present invention may be applied to amultiprocessor data processing system.

[0040] For example, data processing system 300, if optionally configuredas a network computer, may not include SCSI host bus adapter 312, harddisk drive 326, tape drive 328, and CD-ROM 330, as noted by dotted line332 in FIG. 3 denoting optional inclusion. In that case, the computer,to be properly called a client computer, must include some type ofnetwork communication interface, such as LAN adapter 310, modem 322, orthe like. As another example, data processing system 300 may be astand-alone system configured to be bootable without relying on sometype of network communication interface, whether or not data processingsystem 300 comprises some type of network communication interface. As afurther example, data processing system 300 may be a Personal DigitalAssistant (PDA) device which is configured with ROM and/or flash ROM inorder to provide non-volatile memory for storing operating system filesand/or user-generated data.

[0041] The depicted example in FIG. 3 and above-described examples arenot meant to imply architectural limitations.

[0042] With reference now to FIG. 4, an illustration of a scan policy isdepicted in accordance with a preferred embodiment of the presentinvention. Scan policy 400 is a data structure sent to an agent toinitiate a scan of a resource. Scan policy 400 identifies the type ofscan that an agent is to make. For example, scan policy 400 includestarget field 402 in which scan policy 400 contains a URL of the targetthat is to be scanned. A maximum number of files field 404, whichidentifies how may files should be retrieved from the resource. Thesefiles may be for example, web pages, programs, video files, and audiofiles. Maximum number of levels field 406 identifies the maximum numberof levels in the web site to scan. More specifically, this fieldidentifies the number of levels within a web site that should beaccessed by an agent.

[0043] Additionally, a maximum timeout may be set in scan policy 400through max time out field 408 to terminate the scan if too much time isbeing taken. This is especially useful in the instance in which a fileor level does not result in a response. In addition, schedulinginformation also may be present within the scan policy. The informationmay be specified in schedule field 410 to indicate that a scan shouldoccur at a set time. The time may be immediately, periodically, or at anumber of times pre-set within this field. In addition, start dates maybe identified within the scan policy. Further, the server to whom thescan result is to be returned also may be placed within the scan policy.Of course, scan policy 400 may include other fields in addition to or inplace of those illustrated. For example, a return field identifying thecomputer to which the results should be returned may be specified inscan policy 400. This field may be for example, a DNS name or an IPaddress.

[0044] Turning next to FIG. 5, a diagram illustrating a data structurereturning results from a scan is depicted in accordance with a preferredembodiment of the present invention. Results 500 is a data structure,which is returned by an agent for analysis. Results 500 includes a URLfield 502, which contains the URL scanned by the agent. In addition, afile size field 504 is used to indicate the size of the file retrievedfrom the URL. Download time field 506 is used to identify the amount oftime needed to download the file. Type of file field 508 is used toidentify the file type, such as whether the file is an executable file,a text file, a graphics file, or an audio file.

[0045] Results 500 also includes a DNS resolution time field 510, whichidentifies the amount of time taken to resolve the DNS name when theagent accessed the web site. TCP/IP header field 512 identifies theTCP/IP header. Some information is extracted from the TCP and IP headersof each response from the server, and is saved in TCP/IP header field512. The IP header information saved includes: the time-to-live (TTL)value; the source IP address; and the destination IP address. The TCPheader information saved includes: the source port number; and thedestination port number. Type of service field 514 indicates the TCP/IPbased service. It contains the TCP/IP port number of service. Connectiontype field 518 provides information as to the connection used by theagent to scan the web site. For example, this field may include the typeof connection along with the speed of the connection (e.g., Ethernet, 64k).

[0046] The illustrated fields in results 500 are not intended aslimitations as to the type of information that may be returned by anagent for analysis. For example, other information, such as networkround trip time, HTTP errors, DNS errors, and modem failures may berecorded by an agent and returned in results 500.

[0047] With reference now to FIG. 6, a flowchart of a process used by aserver to initiate a scan of web site or other target server is depictedin accordance with a preferred embodiment of the present invention. Theprocess begins by making a connection to the selected agent (step 600).In this step, the server establishes a communications link with agentcomputers that are to initiate scans. These connections may be madeusing TCP/IP, but could be through any communications protocol. Aftermaking connection to the selected agent computers, then a scan policy,such as scan policy 400 in FIG. 4, is sent to the agent (step 602). Inaccordance with a preferred embodiment of the present invention, thescan may be made from agents from as many different locations aspossible. The number of agents that are sent a scan policy are at leasttwo and may be more depending on the number of available agents and onthe type of data to be returned from the scan.

[0048] Thereafter, results from a scan are received (step 604), and theresults are stored (step 606). A determination is then made as towhether all of the results from scans made by the agents have beenreceived (step 608). If all of the results have not been received, theprocess returns to step 604. Thereafter, the results are analyzed toidentify differences between the various scans (step 610). Then,problems, if any, are identified (step 612). An action is thenidentified and initiated (step 614) with the process terminatingthereafter. This action may take a number of different forms fromproviding an error message to an administrator initiating a correctiveaction. In addition, the action taken may be to send a policy to one ormore agents to perform a scan to collect additional or differentinformation from the resource identified as having a problem. Thisadditional information may be used to further identify the problem or toidentify corrective action needed to resolve the problem.

[0049] With reference now to FIG. 7, a high-level flowchart of a processemployed by an agent to scan a target web site for a server is depictedin accordance with a preferred embodiment of the present invention. Theprocess begins by receiving a policy from a server (step 700). Thispolicy is received when a scan is to be initiated. In response toreceiving the policy, a scan is initiated by the agent on a selectedresource, such as, for example, a web site (step 702). This scanincludes, for example, accessing the resource, requesting data from theresource, and collecting data on the response made by the resource. Thisscan will take place using the protocol used by the resource.Thereafter, the results of the scan are received by the agent (step704). The received results are then returned to the server (step 706)with the process terminating thereafter. Of course, depending on theimplementation, the server sending the policy to the agent may be adifferent server from the server that is to receive and analyze theresults.

[0050] The problems that may be identified by a server using the processillustrated in FIG. 4 analyzing data returned by an agent using theprocess illustrated in FIG. 5 include problems with the communicationslink, problems associated with heavy traffic, and problems with a domainname system (DNS) server, which is a computer that answers domain nameservice queries.

[0051] In addition, problems may occur in firewalls. Other area forproblems include permissions, authorizations, and access control rightson a server. In addition, various server configurations or machineconfigurations may cause problems on a web site.

[0052] With reference now to FIG. 8, a flowchart of a process foranalyzing problems on web sites is depicted in accordance with apreferred embodiment of the present invention. The process begins bydetermining whether a problem is present (step 800). This determinationmay be made by identifying whether variations are present betweenresults from the scan made by the agents. In particular, a determinationis made as to whether the results are statistically significant. If aproblem is present, then the problem is classified (step 802). Theclassification of the problem includes analyzing the results withthresholds and using statistical analysis. Thereafter, differencesbetween normal results and abnormal results are identified (step 804).This step involves correlating the results of the scan made by differentagents to identify common attributes. This step is used to addressproblem scans from different locations in which some scans may be okaywhile other scans are abnormal. The abnormal scans are compared tovarious scans for normal situations and abnormal situations to identifyproblems based on common attributes with different categories. Withreference again to step 800, if a problem is not present, the processterminates.

[0053] With reference now to FIG. 9, a flowchart of a process forstatistical analysis of a data stream is depicted in accordance with apreferred embodiment of the present invention. The process begins byperforming a statistical analysis on the data stream (step 900). Basedon this statistical analysis, a determination is made as to whether anunexpected variation is present from different agents (step 902). Thisidentification of whether a variation is unexpected may be made bycomparing the variation with historical data. If an unexpected variationis present, then a determination is made as to whether the variation issignificant (step 904). Over time, the amount of data gathered fromscans of the same site from the same agent lends itself to statisticalanalysis, which can help in further determining problems. For example,heuristics can be applied in establishing variation thresholds fromeither the mean or the median of a type of data. Transactions withvalues for that data type, which fall outside of those thresholds, willthen be considered anomalous, and will be further analyzed. Thisidentification of whether a variation is significant is typicallydetermined by using statistical analysis of the variation and comparingthe variation with historical data. Alternatively, the variation may becompared to a threshold, which also may be based on historical data. Asignificant variation may include a broken link in which nocommunication or result is returned for a scan from an agent.Alternatively, a significant variation may occur if a download from oneagent takes one half of a second while a download from another agenttakes forty-five seconds.

[0054] Thereafter, a classification of the results is made using acommon set of attributes that have been identified for differentsituations (step 906) with the process terminating thereafter. Thisclassification uses attributes that have been identified for differentproblems, such as a failed link, heavy traffic, access rights failures,and server malfunctions. Based on what attributes are matched to thescans with significant or unexpected variations, the problem andpossible correct action may be taken.

[0055] With reference again to step 904, if the variation is notsignificant, the process also terminates. The process again terminatesif a variation from different agents is not present in step 902.

[0056] DNS resolution tables are an example of a target resource thatmay be scanned by agents located at different locations in a network. ADNS server keeps a database, e.g., a DNS resolution table, of hostcomputers and the corresponding IP addresses. When presented with aname, such as IBM.Com, for example, the DNA server would return the IPaddress of the company. The DNA is a system by which hosts on theInternet have both domain name addresses and Internet Protocol (IP)addresses. The domain name address is used by human users and isautomatically translated into a numerical IP address corresponding tothe domain name address, which is used by the packet routing software.

[0057] From a single agent, the DNS resolution table may seem okay, butmultiple scans from multiple agents located at different sites in anetwork may show a problem. A comparative analysis of the results fromDNS resolutions from different agents will show whether the behavior ofthe scan behaves much differently from different locations. Thevariation in DNS resolutions may be very apparent with a broken link.Such a problem is immediately visible. If one agent has a problem, thismay indicate a problem with a subnet. Multiple agents having problemsmay indicate problems of the following types: a common DNS server thatis accessed by the agents has problems; the DNS server may be down orits network connection is broken; or the DNS server's host name-to-IPaddress table may not be up-to-date, or may contain errors, thusreturning invalid IP addresses.

[0058] In another example, if all of the agents find some performanceproblem, the problem may be tied with the target server. If, however,the agents scanning a web site see different problems, congestion may bethe cause of particular problems in part of the network.

[0059] With reference now to FIG. 10, a flowchart of a process foranalyzing DNS performance is depicted in accordance with a preferredembodiment of the present invention. This process is performed toidentify results from DNS performance based on scans made by agents. Theprocess begins by analyzing the DNS performance to determine whether astrange or abnormal DNS resolution was encountered by an agent (step1000). Such an occurrence may indicate a problem with a subnet. A subnetis a portion of a network in which computers in the subnet show the samenetwork attributes, such as, for example, the same prefix in an IPaddress. Essentially, a subnet is a group of machines that have accessto the same network resources.

[0060] Thereafter, The DNS resolution is analyzed (step 1002). Inanalyzing the DNS resolution, a determination is made as to whether aresolution from a name to an IP address occurred. This determination maybe made by identifying agents at different locations as to whether allof the locations returned a proper result. An improper DNS resolution atone site may indicate a problem with a subnet or particular geographiclocation. This kind of test may involve agents both in an Internet andan Extranet.

[0061] With reference now to FIG. 11, a flowchart of a process employedby an agent to scan a web site is depicted in accordance with apreferred embodiment of the present invention. This process is alsoreferred to as a scanning process or a web crawling process. The scanbegins by collecting statistics at the root universal resource locator(URL) (step 1100). The information obtained from a scan may include theURL, the size of the document at the URL, the download time for thedocument, DNS resolution time, TCP/IP header information, the type ofdocument, the type of service, and the connection type. The type ofdocument identified also may be as to whether the document is executableor a text or graphic document or a combination of the two. Theconnection type may include identification of whether the type ofconnection along with the speed. Next, the policy is examined forlength, depth, and the number of documents to be retrieved (step 1102).A determination is then made as to whether the number of links is equalto that identified by the policy (step 1104). If the number of links isnot equal to the number set by the policy, then a determination is madeas to whether the number of documents to be retrieved equals that set bythe policy (step 1106). If the number of documents has not been reached,then the next URL is identified (step 1108).

[0062] Thereafter, statistics is collected for the URL (step 1110) withthe process then returning to step 1104. If the number of links is equalto that set by the policy in step 1104, or the number of documentsretrieved is equal to that set by the policy in step 1106, then theprocess will send the result identified in the policy (step 1112) withthe process terminating thereafter.

[0063] It is important to note that while the present invention has beendescribed in the context of a fully functioning data processing system,those of ordinary skill in the art will appreciate that the processes ofthe present invention are capable of being distributed in the form of acomputer readable medium of instructions and a variety of forms and thatthe present invention applies equally regardless of the particular typeof signal bearing media actually used to carry out the distribution.Examples of computer readable media include recordable-type media such afloppy disc, a hard disk drive, a RAM, and CD-ROMs and transmission-typemedia such as digital and analog communications links.

[0064] The description of the present invention has been presented forpurposes of illustration and description, but is not intended to beexhaustive or limited to the invention in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art. The embodiment was chosen and described in order to bestexplain the principles of the invention the practical application and toenable others of ordinary skill in the art to understand the inventionfor various embodiments with various modifications as are suited to theparticular use contemplated

What is claimed is:
 1. A method for identifying problems associated witha web site comprising the data processing system implemented steps of:initiating a scan of a web site by a plurality of agents, wherein eachof the plurality of agents are at a different location in thedistributed data processing system; obtaining results of the scan fromthe plurality of agents; and analyzing the results of the scan todetermine if a problem is associated with the web site.
 2. The method ofclaim 1, wherein the step of obtaining includes: receiving, from each ofthe plurality of agents, data about data streams resulting from a scanof the web site.
 3. The method of claim 2, wherein the step of analyzingincludes: comparing the data about data streams resulting from a scan ofthe web site to identify variations from different agents within theplurality of agents.
 4. The method of claim 3 further comprising:comparing the data about data streams to prior data about data streamsgenerated from scanning the web site.
 5. The method of claim 3 furthercomprising: responsive to identifying variations from different agents,comparing the data about data streams to at least one threshold, whereinthe threshold indicates a level at which a problem is present.
 6. Themethod of claim 5, wherein the threshold is derived from historical datagenerated from prior scans of the web site.
 7. The method of claim 5further comprising: comparing the data about data streams withattributes associated with a plurality of classifications; responsive tocomparing the data, associating the data about data streams withclassifications.
 8. The method of claim 1, wherein the step ofinitiating includes: sending a policy to each of the plurality ofagents, wherein the policy identifies the web site to be scanned.
 9. Themethod of claim 1, wherein the initiating step is performed by a firstcomputer while the obtaining and analyzing steps are performed by asecond computer.
 10. The method of claim 1, wherein each of theplurality of agents is located on a different subnet.
 11. The method ofclaim 1, wherein each of the plurality of agents is located in adifferent geographic location.
 12. A method for scanning a targetresource in a distributed data processing system, the method comprisingthe computer implemented steps of: receiving a plurality of results fromscans of a target resource made by a plurality of agents, wherein theplurality of agents are located at different locations on thedistributed data processing system; and analyzing the plurality ofresults for differences between the plurality of results.
 13. The methodof claim 12 further comprising: initiating the scans of the targetresource by the plurality of agents.
 14. The method of claim 12, whereinthe steps of initiating, receiving, and analyzing are performed by asingle computer.
 15. The method of claim 12, wherein the step ofinitiating is performed by a first computer while the steps of receivingand analyzing are performed by a second computer.
 16. The method ofclaim 12, wherein the plurality of agents is a plurality of computers.17. The method of claim 12, wherein the plurality of agents includes aPersonal Digital Assistant device.
 18. The method of claim 12, whereinthe target resource is a computer.
 19. The method of claim 12, whereinthe target resource is a web site.
 20. The method of claim 12, whereinthe target resource is a file transfer protocol site.
 21. The method ofclaim 12, wherein the plurality of results includes time needed toresolve a domain name address.
 22. The method of claim 12, wherein theplurality of results includes time needed to download a document. 23.The method of claim 12, wherein the plurality of results includes timeneeded to resolve a domain name address.
 24. The method of claim 12further comprising: correlating results of the analysis to identifycommon attributes.
 25. A method for scanning a computer system connectedto a network, the method comprising: performing a scanning process froma first network location and from a second network location, wherein afirst set of scan results and a second set of scan results aregenerated; comparing the first set of scan results with the second setof scan results for differences to detect network related problems; andperforming a given action based on differences detected from comparingthe first set of scan results with the second set of scan results. 26.The method of claim 25, wherein differences detected from comparing thefirst set of scan results with the second set of scan results arecompared to a set of heuristics linked to respective actions and a givenheuristic is detected.
 27. The method of claim 25, wherein the givenaction is performing a different type of scanning process from at leastone of the first network location and the second network location.
 28. Adistributed data processing system comprising: a network, wherein thenetwork provides a medium to establish communications links; a targetcomputer; a server connected to the network, wherein the server analyzesresults of tests to determine whether a problem associated withaccessing the target computer exits; and a plurality of agents connectedto the network, wherein the plurality of agents perform tests on thetarget computer, collects results from the tests, and sends the resultsto the server.
 29. The distributed data processing system of claim 28,wherein the server is a first server and further comprising: a secondserver, wherein the second server initiates commencement of the tests bythe plurality of agents on the target computer.
 30. The method of claim29, wherein the tests are initiated periodically.
 31. The distributeddata processing system of claim 28, wherein each agent in the pluralityof agents is located on a different subnet from another agent with theplurality of agents.
 32. A data processing system for identifyingproblems associated with a web site comprising: initiating means forinitiating a scan of a web site by a plurality of agents, wherein eachof the plurality of agents are at a different location in thedistributed data processing system; obtaining means for obtainingresults of the scan from the plurality of agents; and analyzing meansfor analyzing the results of the scan to determine if a problem isassociated with the web site.
 33. The data processing system of claim32, wherein the step of obtaining includes: receiving means forreceiving, from each of the plurality of agents, data about data streamsresulting from a scan of the web site.
 34. The data processing system ofclaim 32, wherein the step of analyzing includes: comparing means forcomparing the data about data streams resulting from a scan of the website to identify variations from different agents within the pluralityof agents.
 35. The data processing system of claim 34 furthercomprising: comparing means for comparing the data about data streams toprior data about data streams generated from scanning the web site. 36.The data processing system of claim 34 further comprising: comparingmeans, responsive to identifying variations from different agents, forcomparing the data about data streams to at least one threshold, whereinthe threshold indicates a level at which a problem is present.
 37. Thedata processing system of claim 36, wherein the threshold is derivedfrom historical data generated from prior scans of the web site.
 38. Thedata processing system of claim 36 further comprising: comparing meansfor comparing the data about data streams with attributes associatedwith a plurality of classifications; associating means, responsive tocomparing the data, for associating the data about data streams withclassifications.
 39. The data processing system of claim 32, wherein thestep of initiating includes: sending means for sending a policy to eachof the plurality of agents, wherein the policy identifies the web siteto be scanned.
 40. The data processing system of claim 32, wherein theinitiating step is performed by a first computer while the obtaining andanalyzing steps are performed by a second computer.
 41. The dataprocessing system of claim 32, wherein each of the plurality of agentsis located on a different subnet.
 42. The data processing system ofclaim 32, wherein each of the plurality of agents is located in adifferent geographic location.
 43. A data processing system for scanninga target resource in a distributed data processing system, the dataprocessing system comprising: receiving means for receiving a pluralityof results from scans of a target resource made by a plurality ofagents, wherein the plurality of agents are located at differentlocations on the distributed data processing system; and analyzing meansfor analyzing the plurality of results for differences between theplurality of results.
 44. The data processing system of claim 43 furthercomprising: initiating means for initiating the scans of the targetresource by the plurality of agents.
 45. The data processing system ofclaim 43, wherein the steps of initiating, receiving, and analyzing areperformed by a single computer.
 46. The data processing system of claim43, wherein the step of initiating is performed by a first computerwhile the steps of receiving and analyzing are performed by a secondcomputer.
 47. The data processing system of claim 43, wherein theplurality of agents is a plurality of computers.
 48. The data processingsystem of claim 43, wherein the plurality of agents includes a PersonalDigital Assistant device.
 49. The data processing system of claim 43,wherein the target resource is a computer.
 50. The data processingsystem of claim 43, wherein the target resource is a web site.
 51. Thedata processing system of claim 43, wherein the target resource is afile transfer protocol site.
 52. The data processing system of claim 43,wherein the plurality of results includes time needed to resolve adomain name address.
 53. The data processing system of claim 43, whereinthe plurality of results includes time needed to download a document.54. The data processing system of claim 43, wherein the plurality ofresults includes time needed to resolve a domain name address.
 55. Thedata processing system of claim 43 further comprising: correlating meansfor correlating results of the analysis to identify common attributes.56. A data processing system for scanning a computer system connected toa network, the data processing system comprising: first performing meansfor performing a scanning process from a first network location and froma second network location, wherein a first set of scan results and asecond set of scan results are generated; comparing means for comparingthe first set of scan results with the second set of scan results fordifferences to detect network related problems; and second performingmeans for performing a given action based on differences detected fromcomparing the first set of scan results with the second set of scanresults.
 57. The data processing system of claim 56, wherein differencesdetected from comparing the first set of scan results with the secondset of scan results are compared to a set of heuristics linked torespective actions and a given heuristic is detected.
 58. The dataprocessing system of claim 56, wherein the given action is performing adifferent type of scanning process from at least one of the firstnetwork location and the second network location.
 59. A computer programproduct in a computer readable medium for identifying problemsassociated with a web site comprising: first instructions for initiatinga scan of a web site by a plurality of agents, wherein each of theplurality of agents are at a different location in the distributed dataprocessing system; second instructions for obtaining results of the scanfrom the plurality of agents; and analyzing the results of the scan todetermine if a problem is associated with the web site.
 60. A computerprogram product in a computer readable medium for scanning a targetresource in a distributed data processing system, the computer programproduct comprising: first instructions for receiving a plurality ofresults from scans of a target resource made by a plurality of agents,wherein the plurality of agents are located at different locations onthe distributed data processing system; and second instructions foranalyzing the plurality of results for difference between the pluralityof results.
 61. A computer program product in a computer readable mediumfor scanning a computer system connected to a network, the computerprogram product comprising: first instructions for performing a scanningprocess from a first network location and from a second networklocation, wherein a first set of scan results and a second set of scanresults are generated; second instructions for comparing the first setof scan results with the second set of scan results for differences todetect network related problems; and third instructions for performing agiven action based on differences detected from comparing the first setof scan results with the second set of scan results.